Instructions

R markdown is a plain-text file format for integrating text and R code, and creating transparent, reproducible and interactive reports. An R markdown file (.Rmd) contains metadata, markdown and R code “chunks”, and can be “knit” into numerous output types. Answer the test questions by adding R code to the fenced code areas below each item. There are questions that require a written answer that also need to be answered. Enter your comments in the space provided as shown below:

Answer: (Enter your answer here.)

Once completed, you will “knit” and submit the resulting .html document and the .Rmd file. The .html will present the output of your R code and your written answers, but your R code will not appear. Your R code will appear in the .Rmd file. The resulting .html document will be graded and a feedback report returned with comments. Points assigned to each item appear in the template.

Before proceeding, look to the top of the .Rmd for the (YAML) metadata block, where the title, author and output are given. Please change author to include your name, with the format ‘lastName, firstName.’

If you encounter issues with knitting the .html, please send an email via Canvas to your TA.

Each code chunk is delineated by six (6) backticks; three (3) at the start and three (3) at the end. After the opening ticks, arguments are passed to the code chunk and in curly brackets. Please do not add or remove backticks, or modify the arguments or values inside the curly brackets. An example code chunk is included here:

# Comments are included in each code chunk, simply as prompts

#...R code placed here

#...R code placed here

R code only needs to be added inside the code chunks for each assignment item. However, there are questions that follow many assignment items. Enter your answers in the space provided. An example showing how to use the template and respond to a question follows.


Example Problem with Solution:

Use rbinom() to generate two random samples of size 10,000 from the binomial distribution. For the first sample, use p = 0.45 and n = 10. For the second sample, use p = 0.55 and n = 10. Convert the sample frequencies to sample proportions and compute the mean number of successes for each sample. Present these statistics.

set.seed(123)
sample.one <- table(rbinom(10000, 10, 0.45)) / 10000
sample.two <- table(rbinom(10000, 10, 0.55)) / 10000

successes <- seq(0, 10)

round(sum(sample.one*successes), digits = 1) # [1] 4.5
## [1] 4.5
round(sum(sample.two*successes), digits = 1) # [1] 5.5
## [1] 5.5

Question: How do the simulated expectations compare to calculated binomial expectations?

Answer: The calculated binomial expectations are 10(0.45) = 4.5 and 10(0.55) = 5.5. After rounding the simulated results, the same values are obtained.


Submit both the .Rmd and .html files for grading. You may remove the instructions and example problem above, but do not remove the YAML metadata block or the first, “setup” code chunk. Address the steps that appear below and answer all the questions. Be sure to address each question with code and comments as needed. You may use either base R functions or ggplot2 for the visualizations.


##Data Analysis #2

## 'data.frame':    1036 obs. of  10 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ VOLUME: num  28.7 8.1 163.4 12.2 59.7 ...
##  $ RATIO : num  0.15 0.147 0.269 0.185 0.165 ...

Test Items starts from here - There are 10 sections - total of 75 points

#### Section 1: (5 points) ####

(1)(a) Form a histogram and QQ plot using RATIO. Calculate skewness and kurtosis using ‘rockchalk.’ Be aware that with ‘rockchalk’, the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## Skewness: 0.7147056
## Kurtosis (Adjusted): 1.667298

(1)(b) Tranform RATIO using log10() to create L_RATIO (Kabacoff Section 8.5.2, p. 199-200). Form a histogram and QQ plot using L_RATIO. Calculate the skewness and kurtosis. Create a boxplot of L_RATIO differentiated by CLASS.

## Skewness (L_RATIO): -0.09391548
## Kurtosis (Adjusted, L_RATIO): 0.5354309

(1)(c) Test the homogeneity of variance across classes using bartlett.test() (Kabacoff Section 9.2.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  L_RATIO by CLASS
## Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267

Essay Question: Based on steps 1.a, 1.b and 1.c, which variable RATIO or L_RATIO exhibits better conformance to a normal distribution with homogeneous variances across age classes? Why?

***Answer: ( Based on steps 1.a, 1.b, and 1.c, the variable L_RATIO (log-transformed RATIO) exhibits better conformance to a normal distribution with homogeneous variances across age classes due to following reasons : 1.a: For RATIO

Histogram: Right-skewed, not symmetric. QQ Plot: Data points deviate noticeably from the normal line, especially at the tails. Skewness: 0.71 → Indicates moderate positive skew. Adjusted Kurtosis: 1.67 → Indicates a heavier tail than a normal distribution.

Therefore, RATIO does not follow a normal distribution and is therefore not ideal for parametric statistical tests.

1.b: But for Transformed Variable – L_RATIO

Histogram: More symmetric and bell-shaped. QQ Plot: Data points align well with the normal line, even in the tails. Skewness: -0.094 → Nearly symmetrical. Adjusted Kurtosis: 0.535 → Close to that of a normal distribution (which is 0 when adjusted).

Therefore, the log transformation improves the distribution significantly, making L_RATIO approximate a normal distribution.

1.c: For homogeneity of Variance

Bartlett’s K² = 3.1891, p-value = 0.5267 Since p > 0.05, we fail to reject the null hypothesis, which means the variances are homogeneous across age classes. Boxplots of L_RATIO by CLASS also visually confirm consistent spread and symmetry.

So, L_RATIO is clearly the superior choice for statistical analysis.

)***

#### Section 2 (10 points) ####

(2)(a) Perform an analysis of variance with aov() on L_RATIO using CLASS and SEX as the independent variables (Kabacoff chapter 9, p. 212-229). Assume equal variances. Perform two analyses. First, fit a model with the interaction term CLASS:SEX. Then, fit a model without CLASS:SEX. Use summary() to obtain the analysis of variance tables (Kabacoff chapter 9, p. 227).

##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.370 < 2e-16 ***
## SEX            2  0.091 0.04569   6.644 0.00136 ** 
## CLASS:SEX      8  0.027 0.00334   0.485 0.86709    
## Residuals   1021  7.021 0.00688                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.524 < 2e-16 ***
## SEX            2  0.091 0.04569   6.671 0.00132 ** 
## Residuals   1029  7.047 0.00685                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Essay Question: Compare the two analyses. What does the non-significant interaction term suggest about the relationship between L_RATIO and the factors CLASS and SEX?

Answer: ( The comparison between the two ANOVA models - one with an interaction term (CLASS:SEX) and one without reveals that the interaction between CLASS and SEX is not statistically significant (p = 0.867). This suggests that the effect of CLASS on the log-transformed ratio (L_RATIO) does not depend on the SEX of the abalones, and vice versa. In other words, the relationship between L_RATIO and CLASS is consistent across different SEX categories, indicating no meaningful combined or multiplicative effect. Both CLASS and SEX individually show statistically significant effects on L_RATIO, with CLASS having a very strong influence (p < 2e-16) and SEX also showing a significant but smaller effect (p ~ 0.0013). Since the interaction term does not contribute significantly to explaining the variability in L_RATIO, the simpler model without the interaction is preferable. It is more parsimonious and equally effective in capturing the main effects, allowing for a clearer interpretation of how CLASS and SEX independently influence L_RATIO. )

(2)(b) For the model without CLASS:SEX (i.e. an interaction term), obtain multiple comparisons with the TukeyHSD() function. Interpret the results at the 95% confidence level (TukeyHSD() will adjust for unequal sample sizes).

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L_RATIO ~ CLASS + SEX, data = mydata)
## 
## $CLASS
##              diff         lwr          upr     p adj
## A2-A1 -0.01248831 -0.03876038  0.013783756 0.6919456
## A3-A1 -0.03426008 -0.05933928 -0.009180867 0.0018630
## A4-A1 -0.05863763 -0.08594237 -0.031332896 0.0000001
## A5-A1 -0.09997200 -0.12764430 -0.072299703 0.0000000
## A3-A2 -0.02177176 -0.04106269 -0.002480831 0.0178413
## A4-A2 -0.04614932 -0.06825638 -0.024042262 0.0000002
## A5-A2 -0.08748369 -0.11004316 -0.064924223 0.0000000
## A4-A3 -0.02437756 -0.04505283 -0.003702280 0.0114638
## A5-A3 -0.06571193 -0.08687025 -0.044553605 0.0000000
## A5-A4 -0.04133437 -0.06508845 -0.017580286 0.0000223
## 
## $SEX
##             diff          lwr           upr     p adj
## I-F -0.015890329 -0.031069561 -0.0007110968 0.0376673
## M-F  0.002069057 -0.012585555  0.0167236691 0.9412689
## M-I  0.017959386  0.003340824  0.0325779478 0.0111881

Additional Essay Question: first, interpret the trend in coefficients across age classes. What is this indicating about L_RATIO? Second, do these results suggest male and female abalones can be combined into a single category labeled as ‘adults?’ If not, why not?

***Answer: ( The pairwise comparisons among age classes (A1 through A5) show a consistent downward trend in L_RATIO as the age class increases: A2 vs A1: Not significantly different. A3, A4, A5 vs A1: All significantly lower L_RATIO values with increasing magnitude. A5 vs A1: Shows the largest negative difference (-0.09997), highly significant (p < 0.0001). Every consecutive class difference (e.g., A3–A2, A4–A3, A5–A4) is statistically significant, with p-values well below 0.05.

This trend suggests that L_RATIO decreases as age increases. In other words, there is a systematic reduction in L_RATIO with advancing maturity, indicating that age class is a strong and significant predictor of L_RATIO. The decline appears to be gradual but consistent across adjacent age groups.

Looking at the SEX comparisons: Immature vs Female (I–F): Significant difference in L_RATIO (p = 0.0377), with immature having lower L_RATIO. Male vs Female (M–F): No significant difference (p = 0.9413), suggesting L_RATIO values are statistically similar. Male vs Immature (M–I): Significant difference (p = 0.0112), with males having higher L_RATIO than immatures.

Since females and males are not significantly different in L_RATIO, it is reasonable to group them together under a common “adult” category from a statistical standpoint. However, immature abalones differ significantly from both adults (male and female), especially showing lower L_RATIO values.

)***

#### Section 3: (10 points) ####

(3)(a1) Here, we will combine “M” and “F” into a new level, “ADULT”. The code for doing this is given to you. For (3)(a1), all you need to do is execute the code as given.

## 
## ADULT     I 
##   707   329

(3)(a2) Present side-by-side histograms of VOLUME. One should display infant volumes and, the other, adult volumes.

Essay Question: Compare the histograms. How do the distributions differ? Are there going to be any difficulties separating infants from adults based on VOLUME?

***Answer: (Comparisons of the histogram based on the distribution shape: Infants Volume: The distribution is heavily right-skewed. Most infant abalones have lower volume values, mainly below 300. A steep drop-off occurs as volume increases, with very few infants beyond 400.

Adults Volume: The distribution is roughly symmetric, resembling a normal distribution. Most adults have moderate volumes centered around 400–600. Very few adults have volume less than 200.

There’s a clear distinction in the volume ranges: Infants: Mostly < 300. Adults: Mostly > 300, centered around 400–600. However, there is some overlap around the 200-400 range, where both infants and smaller adults exist.

Since the two distributions are mostly separated with different shapes and centers, VOLUME is a useful feature for distinguishing infants from adults. In the overlap region (~200–400), distinguishing infants from small adults could be tricky, and VOLUME alone may not be perfectly accurate. A classification model using just VOLUME might make errors in this zone. )***

(3)(b) Create a scatterplot of SHUCK versus VOLUME and a scatterplot of their base ten logarithms, labeling the variables as L_SHUCK and L_VOLUME. Please be aware the variables, L_SHUCK and L_VOLUME, present the data as orders of magnitude (i.e. VOLUME = 100 = 10^2 becomes L_VOLUME = 2). Use color to differentiate CLASS in the plots. Repeat using color to differentiate by TYPE.

Additional Essay Question: Compare the two scatterplots. What effect(s) does log-transformation appear to have on the variability present in the plot? What are the implications for linear regression analysis? Where do the various CLASS levels appear in the plots? Where do the levels of TYPE appear in the plots?

***Answer: (The scatterplots illustrate the relationship between SHUCK weight and VOLUME, first on the original scale and then on the log-transformed scale. In the original (untransformed) plots, the data exhibit a non-linear, fan-shaped pattern, where the spread of SHUCK weight increases with larger VOLUME values. This unequal variability can violate assumptions of linear regression and lead to misleading results.

After applying a log10 transformation to both SHUCK and VOLUME, the relationship becomes much more linear, meaning the variability is more constant across the range of values. The data points form a tighter, more uniform band around the linear trend. This transformation also reduces the impact of extreme values (outliers), making the relationship easier to model accurately with linear regression. Overall, the log transformation improves the assumptions of linearity, normality, and equal variance, which are all critical for reliable regression analysis.

In both the original and log-transformed plots colored by CLASS, we can see a clear progression: • CLASS A1 and A2 (younger age classes) tend to appear in the lower-left region of the plots, corresponding to smaller VOLUME and SHUCK values. • CLASS A3, A4, and especially A5 (older age classes) appear toward the upper-right, reflecting larger abalones with greater volume and meat (SHUCK weight). This suggests that as CLASS increases (abalones get older), both SHUCK and VOLUME increase in a consistent, growth-related pattern.

In the plots colored by TYPE: • Type “I” (presumably immature) are clustered in the lower-left corner, representing smaller abalones with lower SHUCK and VOLUME. • Type “ADULT” are spread across a wider range and dominate the upper part of the plots, especially at higher VOLUME and SHUCK values.

This indicates that most large abalones are adults, while immature ones are smaller and more tightly grouped in the size distribution.

)***

#### Section 4: (5 points) ####

(4)(a1) Since abalone growth slows after class A3, infants in classes A4 and A5 are considered mature and candidates for harvest. You are given code in (4)(a1) to reclassify the infants in classes A4 and A5 as ADULTS.

## 
## ADULT     I 
##   747   289

(4)(a2) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and TYPE (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2 and Black Section 14.2). Use the multiple regression model: L_SHUCK ~ L_VOLUME + CLASS + TYPE. Apply summary() to the model object to produce results.

## 
## Call:
## lm(formula = L_SHUCK ~ L_VOLUME + CLASS + TYPE, data = mydata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.270634 -0.054287  0.000159  0.055986  0.309718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.796418   0.021718 -36.672  < 2e-16 ***
## L_VOLUME     0.999303   0.010262  97.377  < 2e-16 ***
## CLASSA2     -0.018005   0.011005  -1.636 0.102124    
## CLASSA3     -0.047310   0.012474  -3.793 0.000158 ***
## CLASSA4     -0.075782   0.014056  -5.391 8.67e-08 ***
## CLASSA5     -0.117119   0.014131  -8.288 3.56e-16 ***
## TYPEI       -0.021093   0.007688  -2.744 0.006180 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08297 on 1029 degrees of freedom
## Multiple R-squared:  0.9504, Adjusted R-squared:  0.9501 
## F-statistic:  3287 on 6 and 1029 DF,  p-value: < 2.2e-16

Essay Question: Interpret the trend in CLASS levelcoefficient estimates? (Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays).

***Answer: ( The coefficient estimates for CLASS are: CLASS A2: –0.018 CLASS A3: –0.047 CLASS A4: –0.076 CLASS A5: –0.117

This shows a clear, decreasing trend in the coefficients as CLASS increases from A2 to A5. In practical terms, this means that for a given volume (size), the meat weight (shuck weight) tends to decrease as the abalone matures through higher CLASS levels.

This pattern aligns with biological observations that abalone growth, particularly in terms of meat accumulation, slows after CLASS A3. As a result, CLASS A4 and A5, though physically larger, tend to accumulate less meat per unit of volume and are therefore considered mature and suitable for harvest.

The negative and increasingly large coefficients reflect this slowing growth, indicating that meat yield becomes progressively less efficient with increasing CLASS, even when the overall size (VOLUME) is the same.

)***

Additional Essay Question: Is TYPE an important predictor in this regression? (Hint: This question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions.) Explain your conclusion.

***Answer: (Although the variable TYPE is statistically significant (p = 0.0062), the magnitude of its coefficient (–0.021) is relatively small compared to other predictors—particularly L_VOLUME and the CLASS levels (e.g., CLASS A5 has a coefficient of –0.117).

This suggests that TYPE does contribute to predicting L_SHUCK, but its practical impact is limited. In terms of harvesting decisions, the volume (L_VOLUME) and growth class (CLASS) provide much stronger and more meaningful signals about the expected meat yield (log shuck weight) than TYPE does.

In other words, while TYPE does affect L_SHUCK to a small extent, it is not a primary driver of harvesting decisions when compared to the much larger and clearer effects of VOLUME and CLASS. Therefore, TYPE is less important than the other variables for making harvesting decisions based on meat yield.

)***


The next two analysis steps involve an analysis of the residuals resulting from the regression model in (4)(a) (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2).


#### Section 5: (5 points) ####

(5)(a) If “model” is the regression object, use model$residuals and construct a histogram and QQ plot. Compute the skewness and kurtosis. Be aware that with ‘rockchalk,’ the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## 
## Attaching package: 'moments'
## The following objects are masked from 'package:rockchalk':
## 
##     kurtosis, skewness
## Skewness: -0.0595
## Kurtosis: 3.3498

(5)(b) Plot the residuals versus L_VOLUME, coloring the data points by CLASS and, a second time, coloring the data points by TYPE. Keep in mind the y-axis and x-axis may be disproportionate which will amplify the variability in the residuals. Present boxplots of the residuals differentiated by CLASS and TYPE (These four plots can be conveniently presented on one page using par(mfrow..) or grid.arrange(). Test the homogeneity of variance of the residuals across classes using bartlett.test() (Kabacoff Section 9.3.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  residuals by CLASS
## Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498

Essay Question: What is revealed by the displays and calculations in (5)(a) and (5)(b)? Does the model ‘fit’? Does this analysis indicate that L_VOLUME, and ultimately VOLUME, might be useful for harvesting decisions? Discuss.

***Answer: (The model appears to fit the data well. The residuals are approximately normally distributed, as seen in the histogram and Q-Q plot. The skewness (-0.06) is close to zero, and the kurtosis (3.35) is near the expected value of 3 for a normal distribution. These values suggest that the residuals are symmetric and follow a normal shape, which is an important assumption for regression models to produce reliable estimates.

Residuals are also consistently spread across different groups of CLASS and TYPE, as shown in the boxplots. To confirm this, Bartlett’s test was used to check for homogeneity of variances. The p-value from the test is 0.4498, which is much greater than 0.05. This means we fail to reject the null hypothesis of equal variances, indicating that the model’s prediction errors are stable across all groups. This stability is important because it means the model performs fairly and consistently, regardless of the class or type of abalone.

Because the model satisfies both the normality and equal variance assumptions, we can trust its predictions. Since VOLUME was a key predictor in this model, and the model fits well, this implies that VOLUME is a useful variable for predicting SHUCKED WEIGHT. In practical terms, this means VOLUME can help in harvesting decisions, such as selecting abalones that are likely to yield more meat.

)***


Harvest Strategy:

There is a tradeoff faced in managing abalone harvest. The infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This assignment will use VOLUME to form binary decision rules to guide harvesting. If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.The Management needs to make a decision to implement 1 rule that meets the business goal.

The next steps in the assignment will require consideration of the proportions of infants and adults harvested at different cutoffs. For this, similar “for-loops” will be used to compute the harvest proportions. These loops must use the same values for the constants min.v and delta and use the same statement “for(k in 1:10000).” Otherwise, the resulting infant and adult proportions cannot be directly compared and plotted as requested. Note the example code supplied below.


#### Section 6: (5 points) ####

(6)(a) A series of volumes covering the range from minimum to maximum abalone volume will be used in a “for loop” to determine how the harvest proportions change as the “cutoff” changes. Code for doing this is provided.

(6)(b) Our first “rule” will be protection of all infants. We want to find a volume cutoff that protects all infants, but gives us the largest possible harvest of adults. We can achieve this by using the volume of the largest infant as our cutoff. You are given code below to identify the largest infant VOLUME and to return the proportion of adults harvested by using this cutoff. You will need to modify this latter code to return the proportion of infants harvested using this cutoff. Remember that we will harvest any individual with VOLUME greater than our cutoff.

## [1] 526.6383
## [1] 0.2476573
## [1] 0

(6)(c) Our next approaches will look at what happens when we use the median infant and adult harvest VOLUMEs. Using the median VOLUMEs as our cutoffs will give us (roughly) 50% harvests. We need to identify the median volumes and calculate the resulting infant and adult harvest proportions for both.

## [1] 384.5584

(6)(d) Next, we will create a plot showing the infant conserved proportions (i.e. “not harvested,” the prop.infants vector) and the adult conserved proportions (i.e. prop.adults) as functions of volume.value. We will add vertical A-B lines and text annotations for the three (3) “rules” considered, thus far: “protect all infants,” “median infant” and “median adult.” Your plot will have two (2) curves - one (1) representing infant and one (1) representing adult proportions as functions of volume.value - and three (3) A-B lines representing the cutoffs determined in (6)(b) and (6)(c).

Essay Question: The two 50% “median” values serve a descriptive purpose illustrating the difference between the populations. What do these values suggest regarding possible cutoffs for harvesting?

***Answer: (The two 50% “median” values highlight the difference in volume distributions between infants and adults. The median infant volume is much lower than the median adult volume, indicating that, in general, adults tend to have larger volumes than infants.

This separation suggests that there is a range of possible cutoff values between the two medians where: - A large portion of adults can still be harvested, and - A majority of infants can be protected.

Therefore, the medians demonstrate that it is feasible to set a volume cutoff that achieves a balance between biological conservation and economic yield. Cutoffs between the two medians could serve as practical options for harvest strategies that are both efficient and sustainable. )***


More harvest strategies:

This part will address the determination of a cutoff volume.value corresponding to the observed maximum difference in harvest percentages of adults and infants. In other words, we want to find the volume value such that the vertical distance between the infant curve and the adult curve is maximum. To calculate this result, the vectors of proportions from item (6) must be used. These proportions must be converted from “not harvested” to “harvested” proportions by using (1 - prop.infants) for infants, and (1 - prop.adults) for adults. The reason the proportion for infants drops sooner than adults is that infants are maturing and becoming adults with larger volumes.

Note on ROC:

There are multiple packages that have been developed to create ROC curves. However, these packages - and the functions they define - expect to see predicted and observed classification vectors. Then, from those predictions, those functions calculate the true positive rates (TPR) and false positive rates (FPR) and other classification performance metrics. Worthwhile and you will certainly encounter them if you work in R on classification problems. However, in this case, we already have vectors with the TPRs and FPRs. Our adult harvest proportion vector, (1 - prop.adults), is our TPR. This is the proportion, at each possible ‘rule,’ at each hypothetical harvest threshold (i.e. element of volume.value), of individuals we will correctly identify as adults and harvest. Our FPR is the infant harvest proportion vector, (1 - prop.infants). We can think of TPR as the Confidence level (ie 1 - Probability of Type I error and FPR as the Probability of Type II error. At each possible harvest threshold, what is the proportion of infants we will mistakenly harvest? Our ROC curve, then, is created by plotting (1 - prop.adults) as a function of (1 - prop.infants). In short, how much more ‘right’ we can be (moving upward on the y-axis), if we’re willing to be increasingly wrong; i.e. harvest some proportion of infants (moving right on the x-axis)?


#### Section 7: (10 points) ####

(7)(a) Evaluate a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value. Compare to the 50% “split” points determined in (6)(a). There is considerable variability present in the peak area of this plot. The observed “peak” difference may not be the best representation of the data. One solution is to smooth the data to determine a more representative estimate of the maximum difference.

(7)(b) Since curve smoothing is not studied in this course, code is supplied below. Execute the following code to create a smoothed curve to append to the plot in (a). The procedure is to individually smooth (1-prop.adults) and (1-prop.infants) before determining an estimate of the maximum difference.

(7)(c) Present a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value with the variable smooth.difference superimposed. Determine the volume.value corresponding to the maximum smoothed difference (Hint: use which.max()). Show the estimated peak location corresponding to the cutoff determined.

Include, side-by-side, the plot from (6)(d) but with a fourth vertical A-B line added. That line should intercept the x-axis at the “max difference” volume determined from the smoothed curve here.

(7)(d) What separate harvest proportions for infants and adults would result if this cutoff is used? Show the separate harvest proportions. We will actually calculate these proportions in two ways: first, by ‘indexing’ and returning the appropriate element of the (1 - prop.adults) and (1 - prop.infants) vectors, and second, by simply counting the number of adults and infants with VOLUME greater than the vlume threshold of interest.

Code for calculating the adult harvest proportion using both approaches is provided.

## [1] 0.7416332
## [1] 0.7416332

There are alternative ways to determine cutoffs. Two such cutoffs are described below.


#### Section 8: (10 points) ####

(8)(a) Harvesting of infants in CLASS “A1” must be minimized. The smallest volume.value cutoff that produces a zero harvest of infants from CLASS “A1” may be used as a baseline for comparison with larger cutoffs. Any smaller cutoff would result in harvesting infants from CLASS “A1.”

Compute this cutoff, and the proportions of infants and adults with VOLUME exceeding this cutoff. Code for determining this cutoff is provided. Show these proportions. You may use either the ‘indexing’ or ‘count’ approach, or both.

## Cutoff to protect all CLASS A1 infants: 206.786
## Infants harvested (count approach): 0.2872
## Adults harvested (count approach): 0.826
## Infants harvested (index approach): 0.2872
## Adults harvested (index approach): 0.826

(8)(b) Next, append one (1) more vertical A-B line to our (6)(d) graph. This time, showing the “zero A1 infants” cutoff from (8)(a). This graph should now have five (5) A-B lines: “protect all infants,” “median infant,” “median adult,” “max difference” and “zero A1 infants.”

#### Section 9: (5 points) ####

(9)(a) Construct an ROC curve by plotting (1 - prop.adults) versus (1 - prop.infants). Each point which appears corresponds to a particular volume.value. Show the location of the cutoffs determined in (6), (7) and (8) on this plot and label each.

(9)(b) Numerically integrate the area under the ROC curve and report your result. This is most easily done with the auc() function from the “flux” package. Areas-under-curve, or AUCs, greater than 0.8 are taken to indicate good discrimination potential.

## Area Under ROC Curve (AUC): 0.8667

#### Section 10: (10 points) ####

(10)(a) Prepare a table showing each cutoff along with the following: 1) true positive rate (1-prop.adults, 2) false positive rate (1-prop.infants), 3) harvest proportion of the total population

To calculate the total harvest proportions, you can use the ‘count’ approach, but ignoring TYPE; simply count the number of individuals (i.e. rows) with VOLUME greater than a given threshold and divide by the total number of individuals in our dataset.

##                                Strategy Cutoff True_Positive_Rate
## Protect All Infants Protect All Infants 526.64             0.2477
## Median Infant             Median Infant 133.82             0.9331
## Median Adult               Median Adult 384.56             0.4993
## Max Difference           Max Difference 262.14             0.7416
## Zero A1 Infants         Zero A1 Infants 206.79             0.8260
##                     False_Positive_Rate Total_Harvest_Proportion
## Protect All Infants              0.0000                   0.1786
## Median Infant                    0.4983                   0.8118
## Median Adult                     0.0242                   0.3668
## Max Difference                   0.1765                   0.5840
## Zero A1 Infants                  0.2872                   0.6757

Essay Question: Based on the ROC curve, it is evident a wide range of possible “cutoffs” exist. Compare and discuss the five cutoffs determined in this assignment.

***Answer: (The ROC curve illustrates the trade-off between adult yield and infant conservation across various volume-based cutoff strategies. Each point on the curve represents a potential rule, balancing the True Positive Rate (TPR) – adults correctly harvested – against the False Positive Rate (FPR) – infants mistakenly harvested.

The five cutoffs analyzed in this assignment represent key decision strategies with distinct implications: 1. Protect All Infants TPR ≈ 0.25, FPR = 0.00 This is the most conservative strategy, ensuring zero infant harvest. However, it sacrifices yield by only harvesting about 25% of adults. It is optimal for strict biological conservation but not efficient from a yield perspective. 2. Median Adult Volume TPR ≈ 0.50, FPR ≈ 0.024 This cutoff captures half of the adult population while mistakenly harvesting only about 2.4% of infants. It offers a balanced approach, allowing reasonable yield while largely preserving infants. 3. Max Difference TPR ≈ 0.74, FPR ≈ 0.26 This point corresponds to the maximum vertical gap between the TPR and FPR curves, indicating the most efficient tradeoff between yield and conservation. It is arguably the best overall strategy, balancing adult harvest with moderate infant loss. 4. Zero A1 Infants TPR ≈ 0.87, FPR ≈ 0.44 This strategy ensures complete protection of A1-class infants, who may be biologically or commercially significant. While some other infants are harvested, it maintains a very high adult yield and prioritizes protecting the most vulnerable subgroup. 5. Median Infant Volume TPR ≈ 0.93, FPR ≈ 0.50 This cutoff harvests nearly all adults, but at the cost of harvesting 50% of infants. Although economically efficient, it poses a serious risk to future sustainability, making it biologically inadvisable.

Conclusion

The ROC curve reveals that a wide spectrum of strategies exists. For maximum long-term sustainability, the “Protect All Infants” rule is safest. However, “Max Difference” emerges as the most effective compromise, providing strong yield while still protecting a significant portion of the infant population. If further protection of specific vulnerable groups (e.g., CLASS A1) is a priority, the “Zero A1 Infants” rule is a practical alternative with high efficiency and targeted conservation.

)***

Final Essay Question: Assume you are expected to make a presentation of your analysis to the investigators How would you do so? Consider the following in your answer:

  1. Would you make a specific recommendation or outline various choices and tradeoffs?
  2. What qualifications or limitations would you present regarding your analysis?
  3. If it is necessary to proceed based on the current analysis, what suggestions would you have for implementation of a cutoff?
  4. What suggestions would you have for planning future abalone studies of this type?

***Answer: (If I were expected to present this analysis to the investigators, I would approach it in a structured and strategic manner, focusing on both the data-driven insights and the practical implications of the findings.

  1. Recommendation vs. Choices and Tradeoffs

Rather than prescribing a single recommendation outright, I would outline the five volume-based cutoff strategies analyzed and present their associated tradeoffs using the ROC curve. This visual comparison highlights: • Strategies like “Protect All Infants”, which offer maximum biological protection but lower yield. • Strategies like “Max Difference”, which balance economic efficiency with conservation. • Riskier strategies like “Median Infant”, which yield high returns but compromise sustainability.

This way, the decision-makers can align the strategy with their management objectives, whether conservation-focused, yield-maximizing, or a compromise.

  1. Qualifications and Limitations of the Analysis

I would be transparent about the limitations of the current analysis: • The model is based on historical volume and type classifications, which may not fully reflect real-world harvesting variability (e.g., measurement errors, environmental shifts). • The analysis assumes static growth patterns and fixed classifications (e.g., TYPE and CLASS), while in reality, these may evolve over time. • Socioeconomic or regulatory factors (like legal size limits or economic cost of overharvest) are not directly considered in the model.

These caveats are important for responsible interpretation and cautious policy formation.

  1. Recommendations for Cutoff Implementation

If a decision must be made based on this analysis, I would recommend “Max Difference” as the initial implementation cutoff. This rule: • Harvests a high proportion of adults (~74%) • Keeps infant harvest moderate (~26%) • Aligns well with both conservation and operational goals

I would suggest using this rule with periodic monitoring, and possibly integrating A1-infant protection if population studies indicate vulnerability in that group.

  1. Suggestions for Future Abalone Studies

To improve future analyses and decision-making, I would suggest the following: • Incorporate longitudinal data, including growth and reproduction rates, to model long-term sustainability. • Collect and analyze environmental and habitat variables that may affect abalone size and classification. • Include economic variables, such as market value by size/type, to integrate cost-benefit considerations into harvest planning. • Consider individual tagging or tracking (if feasible) to study transitions from infant to adult classifications and refine model accuracy. • Develop adaptive management simulations to evaluate dynamic strategies that can adjust cutoffs based on observed outcomes.

Summary

My presentation would focus on informed tradeoff choices, acknowledge uncertainties, and provide practical implementation suggestions. I would also emphasize the need for ongoing data collection and adaptive management to refine harvesting strategies over time. )***